Automated Transport Mode Detection of GPS Tracking Data
Author
Cyril Geistlich, Micha Franz
Abstract
This project aims to investigate key factors and features in GPS tracking data to differentiate transportation vehicles. Machine learning is applied to automate transportation mode detection using spatial, temporal, and attribute analysis. Manual verification of results ensures accuracy. The findings contribute to computational movement analysis and automated transportation mode detection.
1. Introduction
In recent years, the spread of GPS-enabled devices and progress in location-based technologies have generated vast amounts of GPS tracking data. This data holds significant potential for extracting insights and to improve our understanding of human mobility patterns. One main application in this field is the differentiation of transportation modes. This can benefit various domains such as traffic management or urban planning. Determining the mode of transportation from GPS tracking data presents several challenges. With the ubiquitous increase of GPS tracking through smartphones and other technical devices, it’s too time consuming and expensive to manually annotate data and also prone to human error or biases. This leads to the following two research questions:
What are the key factors and features that can be extracted from GPS tracking data to differentiate between different types of transportation modes?
How can machine learning techniques be applied to GPS tracking data to automate the detection of the mode of transportation and which accuracies can be achieved by different machine learning algorithms?
The project will focus on exploring spatial and temporal aspects to extract key factors from GPS tracking data, such as velocity, sinuosity or angles. Additionally, spatial context in the form of traffic networks and land cover is added to the data in order to improve the accuracy of transportation mode detection. Machine learning algorithms will be tested and employed to automate the classification of transportation modes. An accurate algorithm is aimed to be found by training and evaluating different models on labeled data. These models include random forests, support vector machines or neural networks. To ensure the accuracy of the models, a subset of the classified data is used to validate the performance. By comparing the results of the automated classification with ground truth data, the project aims to assess the achieved accuracies of different machine learning algorithms and identify areas for improvement.
2. Data
The main data are the GPS tracking data, which were recorded through the Posmos App via smartphone throughout a time span of approximately 1.5 months by the two authors and from the available data pool. The complete collected data was manually labelled to ensure a valid ground truth. Further, spatial context data such as the Swiss road network, tram network,1 train network2 and the bus network of the cantons of Zurich3 and Bern.4 (Note: There is no official data set for the entire Swiss bus network according to the federal bureau of transport. Thus the available ones for Bern and Zurich were used, where a significant amount of data points pertaining to bus usage were collected). To facilitate the detection of the transportation mode boat, land cover data containing all Swiss waters was also used.5
Code
library("dplyr")library("sf")library("readr") library("ggplot2")library("mapview")library("lubridate")library("zoo") library("caret")library("LearnGeom") # to calculate anglelibrary("geosphere") # to calculate distanceslibrary("RColorBrewer") # to create custom color paletteslibrary("ggcorrplot") # for correlation matrixlibrary("ROSE")library("gridExtra")
Code
# creates lines out of points, used for visualisation purposespoint2line <-function(points){ geometries <-st_cast(st_geometry(points %>%select(geometry)), "POINT") n <-length(geometries) -1 linestrings <-lapply(X =1:n, FUN =function(x) { pair <-st_combine(c(geometries[x], geometries[x +1])) line <-st_cast(pair, "LINESTRING")return(line) }) multilinetring <-st_multilinestring(do.call("rbind", linestrings)) df <-data.frame(linestrings[1])for (i in2:length(linestrings)){ temp <-data.frame(linestrings[i]) df <-rbind(df, temp) } sf_lines <- df %>%st_as_sf()}un_col <-function(df){return(length(unique(df)))}
Code
# read personal tracking dataposmo_micha_truth_csv <-read.delim("data/manually_labelled/posmo_20230502_to_20230613_m.csv",sep=",") posmo_cyril_truth_csv <-read.delim("data/manually_labelled/posmo_2023-05-01T00_00_00+02_00-2023-06-26T23_59_59+02_00.csv",sep=",") posmo_micha_csv <-read.delim("data/posmo_labelled/posmo_20230502_to_20230613_p.csv",sep=",") # read tracking data from poolposmo_pool_1 <-read.delim("data/manually_labelled/posmo.csv",sep=",") %>%tail(612) # last 250 data points are not correctly labelledposmo_pool_2 <-read.delim("data/manually_labelled/posmo_2.csv",sep=",") posmo_pool_3 <-read.delim("data/manually_labelled/posmo_BuJa.csv",sep=",")
process_posmo_data <-function(posmo_data) { # function with data cleaning steps# Convert to sf object posmo_data <- posmo_data %>%st_as_sf(coords =c("lon_x", "lat_y"), crs =4326) %>%st_transform(crs =2056)# Remove unwanted columns posmo_data <- posmo_data[, -c(1, 3, 4)]# Fix Timestamp posmo_data$datetime <-ymd_hms(posmo_data$datetime) +hours(2)# Add ID to rows posmo_data <- posmo_data %>%mutate(id =row_number())# remove duplicate time values posmo_data <- posmo_data[!duplicated(posmo_data$datetime), ]# remove subsequent duplicate location (person wasn't moving) posmo_data <- posmo_data %>%filter(geometry !=lead(geometry))return(posmo_data)}
3. Methods
3.1 Preprocessing
When tracking a person throughout the day using GPS data, there are instances where the person appears to be stationary, such as when in an office or at a university. However, due to GPS inaccuracies, these stationary points may not appear at the exact same location and can exhibit erratic movement patterns. The accuracy of GPS signals is often compromised in dense buildings, amplifying this phenomenon. Figure XXX (screen noch machen) shows an example of this phenomenon around the Irchel campus of the University of Zurich. As a result, parameters like velocity and step length can show values that are typically associated with other categories. To address this issue, two approaches have been employed.
The first approach involves analyzing the angles between consecutive points. Typically, these angles are significantly smaller for stationary points compared to other movements. By visually determining a threshold angle the data set is filtered to remove all data points with angles smaller than 60°. This process needs to be repeated iteratively until no angles below the threshold remain, as the removal of data points alters the angles between the remaining points. Figure XXX (screen noch machen) shows several iterations, removing more and more points with an angle below the threshold. One problem of this approach is that in some cases small angles can also emerge naturally and not due to an error. These points are then falsely removed. There is a special case of this problem in the context of a U-turn or a sharp change in direction, the angle between the points just before and after the turn may indeed be small. Consequently, the removal of these points leads to a re-calculation of angles, which can result in the subsequent removal of additional points and the loss of significant segments. Figure XXX demonstrates this phenomenon. However, through visual inspection of a representative amount of data, this only occurs rarely.
The second approach considers the distance between the current point and a set number of preceding and consecutive points. A point is deemed static if the maximum distance between that point and any of the set number of preceding or consecutive points exceeds a predefined distance threshold. However, this approach may unintentionally remove non-static data points, particularly when a person is walking slowly and numerous data points are recorded within a small distance. Adjusting the distance threshold or the number of preceding and consecutive points can mitigate this issue, but it requires striking a balance between filtering out false movements and retaining genuine data. The sampling rate of Posmos was set to 10 or 15 seconds, but in some cases, data points were recorded every three seconds. Obviously, this enhances the the chances of removing data in the just described way. Since this behavior was not expected and only discovered late in the process, the point exhibiting an abnormally short sampling interval were not removed prior to preprocessing.
Finding the optimal compromise between these filtering approaches involves considering the specific characteristics of the tracked person’s movements and the quality of the GPS data. By iteratively applying the angle-based filtering and analyzing the distance to surrounding points, a more accurate identification of stationary periods can be achieved, mitigating the impact of GPS inaccuracies and preserving the integrity of the tracking data. Thus, the thresholds were set by trial and error.
Code
filterStaticByDistance <-function(data, threshold_distance, consecutive_points) {require(geosphere)# transform to WGS84, necessary to calculate distance using geosphere data <- data %>%st_transform(4326)# Extract coordinates from the geometry coords <-data.frame(st_coordinates(data)) data$longitude <- coords$X data$latitude <- coords$Y# Calculate distances to preceding and consecutive points distances <-numeric(nrow(data))for (i in (consecutive_points +1):(nrow(data) - consecutive_points)) { next_points <- coords[(i +1):(i + consecutive_points), ] prev_points <- coords[(i -1):(i - consecutive_points), ] all_points <-rbind(next_points, prev_points) distances[i] <-max(geosphere::distGeo(coords[i, ], all_points)) }# Filter out points where the maximum distance exceeds the threshold filtered_data <- data[distances >= threshold_distance | distances ==0, ] # keep first/last values which are 0# Transform back to LV95 filtered_data <- filtered_data %>%st_transform(2056)return(list(filtered_data = filtered_data, distances = distances)) # distances are just needed for testing thresholds}
Code
getAngle <-function(coords) { angles <-numeric(nrow(coords)) # Initialize angles as a numeric vector angles[1] =NA# first point can't have an anglefor (i in2:(nrow(coords) -1)) { # calculate the angle for 3 consecutive points, similar to lag/lead angle <-Angle( #function from library LearnGeomc(coords[i -1, "X"], coords[i -1, "Y"]),c(coords[i, "X"], coords[i, "Y"]),c(coords[i +1, "X"], coords[i +1, "Y"]) ) angles[i] <- angle # Assign the calculated angle to the corresponding index in angles } angles[nrow(coords)] =NA# last point cant have an anglereturn(c(angles))}
Code
filterStaticByAngle <-function(working_dataset, angleTreshold){ coords <-data.frame(st_coordinates(working_dataset), working_dataset$id) working_dataset$angle <-getAngle(coords) min_angle <-min(working_dataset$angle, na.rm = T)while (min_angle <= angleTreshold) { # iteratively filter out tight angles until none smaller 60 are left working_dataset <- working_dataset %>%filter(is.na(angle) | angle > angleTreshold) # exclude first and last value (=NA) coords <-data.frame(st_coordinates(working_dataset), working_dataset$id) working_dataset$angle <-getAngle(coords) min_angle <-min(working_dataset$angle, na.rm = T) }return(working_dataset)}
Code
# result <- filterStaticByDistance(working_dataset, threshold_distance = 60, consecutive_points = 5)# filteredByDistance <- result$filtered_data# working_dataset$distances <- result$distances # just for testing threshold values# filteredByAngle <- filterStaticByAngle(working_dataset, 60)# filtertedByDistance_and_Angle <- filterStaticByAngle(filteredByDistance, 60)
Due to similar movement parameters for transportation modes it is a particularly challenging task to automatically classify transportation modes using only movement parameters. Bus and tram in cities for example, exhibit very similar characteristics. To facilitate the classification task, the data was enriched with spatial context data in the form of various networks and land cover, as mentioned in the data description. By incorporating this spatial context data, the classification process can be enhanced by considering the surrounding environment in which the transportation modes operate. For every data point, the closest distance to the difference networks and water bodies was calculated. In some cases, the calculated distances to these networks or water bodies could be extremely large. Including such large values in the data set would lead to a significant span of values, potentially overshadowing smaller differences within cities. To avoid this issue, a decision was made to set a maximum distance of 100m. Any distance beyond 100m was assigned a value of 100m. By setting this threshold, the data set ensures that distances beyond 100m are treated as equal, effectively reducing the influence of extremely large distances on the classification task.
It is important to note that the distance calculation in the data set may not always provide an accurate representation of real distances, especially in cases involving tunnels or underground passages with overlaying data points. An example of this can be seen below, where a tunnel leads close underneath the house of one of the authors and wrong distance proximites are calculated.
Code
# since some of the networks are extremly large data sets, a buffer of all data points were intersected with the networks, and only the network segments that intersected were used to calculate the distancedata_AOI <-st_buffer(working_dataset, 50) %>%st_union()tram_netz_AOI <-st_intersection(tram_netz, data_AOI)working_dataset$distance_tram <-as.numeric(st_distance(working_dataset, tram_netz_AOI))working_dataset$distance_tram <-ifelse(working_dataset$distance_tram >100, 100, working_dataset$distance_tram)zug_netz_AOI <-st_intersection(zug_netz, data_AOI)working_dataset$distance_zug <-as.numeric(st_distance(working_dataset, zug_netz_AOI))working_dataset$distance_zug <-ifelse(working_dataset$distance_zug >100, 100, working_dataset$distance_zug)gewaesser_AOI <-st_intersection(gewaesser, data_AOI)working_dataset$distance_gewaesser <-as.numeric(st_distance(working_dataset, gewaesser_AOI))working_dataset$distance_gewaesser <-ifelse(working_dataset$distance_gewaesser >100, 100, working_dataset$distance_gewaesser)bus_netz_AOI <-st_intersection(bus_netz, data_AOI)working_dataset$distance_bus <-as.numeric(st_distance(working_dataset, bus_netz_AOI))working_dataset$distance_bus <-ifelse(working_dataset$distance_bus >100, 100, working_dataset$distance_bus)working_dataset$distance_strasse <-as.numeric(st_distance(working_dataset, strassen))working_dataset$distance_strasse <-ifelse(working_dataset$distance_strasse >100, 100, working_dataset$distance_strasse)
# Replace NA values with a specified value (e.g., mean, median, or 0)working_dataset$sinuosity[is.infinite(working_dataset$sinuosity)] <-NAworking_dataset <-na.omit(working_dataset)posmo_pool$sinuosity[is.infinite(posmo_pool$sinuosity)] <-NAposmo_pool <-na.omit(posmo_pool)
Variable Correlation
The correlation matrix shows correlation between variables. Computed velocities and acceleration correlate strongly. Other variables show only little correlation.
Code
# select columns with relevant variable and standardize themstandardized <- working_dataset[, 6:15] %>%st_drop_geometry() %>%scale(center =TRUE, scale =TRUE) %>%as.data.frame()corr_matrix <-cor(standardized)ggcorrplot(corr_matrix)
# Save full dataset as csvworking_dataset <-st_drop_geometry(working_dataset)posmo_pool <-st_drop_geometry(posmo_pool)write.csv(working_dataset, file ="data/full_working_dataset.csv", row.names = F)write.csv(posmo_pool, file ="data/full_posmo_pool_dataset.csv", row.names = F)
3.4 Class Distribution Overview
Code
working_dataset <-read.delim("data/full_working_dataset.csv",sep=",", header = T) posmo_pool <-read.delim("data/full_posmo_pool_dataset.csv",sep=",", header = T) working_dataset <-rbind(working_dataset, posmo_pool)working_dataset <-na.omit(working_dataset)# Show class distributionggplot(working_dataset) +geom_bar(aes(x = transport_mode)) +theme(axis.text.x =element_text(angle =45, hjust =1), panel.background =element_rect(fill ="transparent", color =NA)) +ggtitle("Class Distribution over Unfiltered Data Set") +xlab("Transport Mode") +ylab("Count")
The distribution shows that many classes are very poorly represented in the data. Unclassified data is removed and aggregated. The underrepresented transport modes are moved to the class “Other”.
Bike Boat Bus Car Horse Other Train Tram Walk
2120 417 4100 7272 3866 468 18577 4732 10283
The dotted red line lies at a count of 500, representing the desired sample count for the following under sampling of our data set.
3.5 Sampling Interval
The sampling intervals were found to be highly inconsistent. Many large sampling intervals originate from the tracked person being stationary. Therefore the sampling interval is limited to 60 seconds. No re-sampling to equalize the sampling interval is undertaken, to preserve the GPS position and the calculated parameters for each data point, since with large sampling intervals the calculated movement parameters become inaccurate and unrepresentative of the transport mode. After applying the threshold the actual sampling interval of 10, respective 15 seconds can be seen in the box plot.
Code
boxplot_diff_s <-ggplot(working_dataset,aes(x = transport_mode, y = diff_s)) +geom_boxplot() +theme(axis.text.x =element_text(angle =45, hjust =1), panel.background =element_rect(fill ="transparent", color =NA)) +ylab("sample interval [s]") +xlab("Transport Mode") +ggtitle("Sample Interval per Class")# Set threshold for parametersworking_dataset <- working_dataset[working_dataset$diff_s <60,]boxplot_diff_s_after <-ggplot(working_dataset, aes(x = transport_mode, y = diff_s)) +geom_boxplot() +theme(axis.text.x =element_text(angle =45, hjust =1), panel.background =element_rect(fill ="transparent", color =NA)) +ylab("sample interval [s]") +xlab("Transport Mode") +ggtitle("Sample Interval per Class \nAfter Threshold")# Display the plots side by sidegrid.arrange(boxplot_diff_s, boxplot_diff_s_after, nrow =1)
After the initial removal of sampling intervals larger than 60 seconds we repeat the step for the moving window sampling intervals.
Code
boxplot_diff_s_mean <-ggplot(working_dataset, aes(x = transport_mode, y = diff_s_mean)) +geom_boxplot() +theme(axis.text.x =element_text(angle =45, hjust =1), panel.background =element_rect(fill ="transparent", color =NA)) +ylab("sample intervall [s]")# Set threshold for parametersworking_dataset <- working_dataset[working_dataset$diff_s_mean <60,]boxplot_diff_s_mean_after <-ggplot(working_dataset, aes(x = transport_mode, y = diff_s_mean)) +geom_boxplot() +theme(axis.text.x =element_text(angle =45, hjust =1), panel.background =element_rect(fill ="transparent", color =NA)) +ylab("sample intervall [s]")# Display the plots side by sidegrid.arrange(boxplot_diff_s_mean, boxplot_diff_s_mean_after, nrow =1)
3.6 Parameter Thresholds
3.6.1 Velocity
The velocity attribute shows some outliers for the train class and walking class. The threshold for maximum velocity is set to 55.55 m/s (200km/h ), as no transport mode in our analysis is expected to exceed such velocity. One exception are airplanes, but with only very few data points there is no benefit in including higher velocities. After setting the threshold some obvious outliers remain for the walking class. Reasons for such outliers in the calculated velocity could be:
Wrong Classification, even though the data is verified.
GPS inaccuracies, where the GPS point location is “jumping” creating very inaccurate, zigzagging tracking data.
Code
boxplot_velocity <-ggplot(working_dataset, aes(x = transport_mode, y = velocity)) +geom_boxplot() +theme(axis.text.x =element_text(angle =45, hjust =1)) +ylab("velocity [m/s]") +xlab("Transport Mode") +ggtitle("Velocity per Class")# Set threshold for parametersworking_dataset <- working_dataset[working_dataset$velocity <55.55,]boxplot_velocity_after <-ggplot(working_dataset, aes(x = transport_mode, y = velocity)) +geom_boxplot() +theme(axis.text.x =element_text(angle =45, hjust =1)) +ylab("velocity [m/s]") +xlab("Transport Mode") +ggtitle("Velocity per Class After Threshold")# Display the plots side by sidegrid.arrange(boxplot_velocity, boxplot_velocity_after, nrow =1)
3.6.2 Moving Window Velocity
The moving window velocity shows less extreme outliers. The number of outliers can be reduced further by removing setting the trheshold to 55.5m/s (200km/h). After applying the threshold classes with similar average velocities can be identified. This might already be an indicator for classes which are difficult to distinguish using classification methods.
Code
boxplot_velocity_mean <-ggplot(working_dataset, aes(x = transport_mode, y = velocity_mean)) +geom_boxplot() +theme(axis.text.x =element_text(angle =45, hjust =1)) +ylab("moving window velocity[m/s]")# Set threshold for parametersworking_dataset <- working_dataset[working_dataset$velocity_mean <55.55,]boxplot_velocity_mean_after <-ggplot(working_dataset, aes(x = transport_mode, y = velocity_mean)) +geom_boxplot() +theme(axis.text.x =element_text(angle =45, hjust =1)) +ylab("moving window velocity [m/s]")# Display the plots side by sidegrid.arrange(boxplot_velocity_mean, boxplot_velocity_mean_after, nrow =1)
3.6.3 Acceleration
The acceleration threshold is set to 10m/s^2, as for this classification is considered to be the maximum possible acceleration for all classes. The distribution of the classes is similar to the velocities. In the parameter correlation analysis strong correlation between velocity and acceleration was found.
Code
boxplot_acceleration <-ggplot(working_dataset, aes(x = transport_mode, y = acceleration)) +geom_boxplot() +theme(axis.text.x =element_text(angle =45, hjust =1), panel.background =element_rect(fill ="transparent", color =NA)) +ylab("acceleration [m/s^2]")# Set threshold for parametersworking_dataset <- working_dataset[working_dataset$acceleration <10,]boxplot_acceleration_after <-ggplot(working_dataset, aes(x = transport_mode, y = acceleration)) +geom_boxplot() +theme(axis.text.x =element_text(angle =45, hjust =1)) +ylab("acceleration [m/s^2]")# Display the plots side by sidegrid.arrange(boxplot_acceleration, boxplot_acceleration_after, nrow =1)
3.6.4 Moving Window Acceleration
The acceleration threshold is set to \(10m/s^2\), as for the single point acceleration values.
Code
boxplot_acceleration_mean <-ggplot(working_dataset, aes(x = transport_mode, y = acceleration_mean)) +geom_boxplot() +theme(axis.text.x =element_text(angle =45, hjust =1), panel.background =element_rect(fill ="transparent", color =NA)) +ylab("moving window acceleration [m/s^2]")# Set threshold for parametersworking_dataset <- working_dataset[working_dataset$acceleration_mean <10,]boxplot_acceleration_mean_after <-ggplot(working_dataset, aes(x = transport_mode, y = acceleration_mean)) +geom_boxplot() +theme(axis.text.x =element_text(angle =45, hjust =1)) +ylab("moving window acceleration [m/s^2]")# Display the plots side by sidegrid.arrange(boxplot_acceleration_mean, boxplot_acceleration_mean_after, nrow =1)
3.7 Under Sampling
The data set is strongly imbalanced. To improve model accuracy we use under sampling to balance the classes. 500 samples per class are desired. The classes “boat” and “other” do not have sufficient points. The sample size is not further decreased, so enough data is provided to train the and test the computed models.
Code
# Create copy for later useworking_dataset_full <- working_dataset# Set the maximum number of entries per classmax_entries <-500# Perform under samplingworking_dataset <- working_dataset |>group_by(transport_mode) |>sample_n(min(n(), max_entries)) |>ungroup()# Check the resulting undersampled DataFrametable(working_dataset$transport_mode)
Bike Boat Bus Car Horse Other Train Tram Walk
500 352 500 500 500 395 500 500 500
To classify the data a Support Vector Machine (SVM) is applied. A linear SVM, radial SVM and polynomial SVM are tested. We apply a single-train-test split model and a 10 fold cross validation with 3 repeats. The cross validation improves model robustness compared to the single train-test split and reduces bias resulting in a more representative evaluation of the model performance. The tuning sequences are replaced by the best found hyper parameters for each model, to save computation time.
The models are evaluated with the confusion matrix, the overall accuracy, recall, precision, and F1-Score. A confusion matrix is a table that summarizes the performance of a classification model by showing the counts of true positive, true negative, false positive, and false negative predictions. Precision measures the proportion of correctly predicted positive instances out of the total instances predicted as positive. Recall measures the proportion of correctly predicted positive instances out of the total actual positive instances. The F1-score combines precision and recall into a single metric. It provides a balance between precision and recall and is useful when both false positives and false negatives are important.
Code
# Define Control for 10-fold CVfitControl <-trainControl(## 10-fold CVmethod ="repeatedcv",number =10,repeats =3)
We create a training and a test data set. The training data set contains 80% of the data points and the test set contains 20% of the data points.
Code
# Convert to Factorworking_dataset$transport_mode <-as.factor(working_dataset$transport_mode)# Create Training and Test Data SetTrainingIndex <-createDataPartition(working_dataset$transport_mode, p =0.8, list = F)TrainingSet <- working_dataset[TrainingIndex,]TestingSet <- working_dataset[-TrainingIndex,]
3.8.1 Liner SVM
A linear support vector machine is tested and the performance evaluated. Different hyper parameter settings were tested to find the best model. For the linear SVM the best fit found is for C = 3 achieving an overall accuracy of 78.1%. Precision, recall and F1-score vary for the classes but average around 78-79%.
Code
# Set seed for reproducibilityset.seed(100)# Perform Linear SVMmodel.svmL <-train(transport_mode ~ ., data = TrainingSet,method ="svmLinear",na.action = na.omit,preprocess =c("scale", "center"),trControl =trainControl(method ="none"),tuneGrid =data.frame(C =3), )# Perform Linear SVM with 10-fold Cross Validation (Reduce Length for shorter computation time)model.svmL.cv <-train(transport_mode ~ ., data = TrainingSet,method ="svmLinear",na.action = na.omit,preprocess =c("sclae","center"),trControl = fitControl,tuneGrid =expand.grid(C =seq(3, 6, length =4) # Find best Fit Model ))# Show Best Tune#print(model.svmL.cv$bestTune)# Make Predictionsmodel.svmL.training <-predict(model.svmL, TrainingSet)model.svmL.testing <-predict(model.svmL, TestingSet)model.svmL.cv.training <-predict(model.svmL.cv, TrainingSet)model.svmL.cv.testing <-predict(model.svmL.cv, TrainingSet)# Model Performancemodel.svmL.training.confusion <-confusionMatrix(model.svmL.training, as.factor(TrainingSet$transport_mode))model.svmL.testing.confusion <-confusionMatrix(model.svmL.testing, as.factor(TestingSet$transport_mode))model.svmL.cv.training.confusion <-confusionMatrix(model.svmL.cv.training, as.factor(TrainingSet$transport_mode))(model.svmL.cv.testing.confusion <-confusionMatrix(model.svmL.cv.testing, as.factor(TrainingSet$transport_mode))) # Print test run with CV
Confusion Matrix and Statistics
Reference
Prediction Bike Boat Bus Car Horse Other Train Tram Walk
Bike 250 0 64 26 0 6 3 11 1
Boat 0 282 0 0 0 0 0 0 0
Bus 35 0 167 11 0 9 2 3 10
Car 7 0 27 345 1 15 0 0 1
Horse 7 0 5 8 396 10 0 0 103
Other 15 0 7 5 2 200 0 1 15
Train 0 0 3 2 0 0 392 1 0
Tram 78 0 95 1 0 36 2 361 17
Walk 8 0 32 2 1 40 1 23 253
Overall Statistics
Accuracy : 0.7787
95% CI : (0.7644, 0.7926)
No Information Rate : 0.1177
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7505
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Bike Class: Boat Class: Bus Class: Car Class: Horse
Sensitivity 0.62500 1.00000 0.41750 0.8625 0.9900
Specificity 0.96298 1.00000 0.97665 0.9830 0.9556
Pos Pred Value 0.69252 1.00000 0.70464 0.8712 0.7486
Neg Pred Value 0.95061 1.00000 0.92629 0.9817 0.9986
Prevalence 0.11772 0.08299 0.11772 0.1177 0.1177
Detection Rate 0.07357 0.08299 0.04915 0.1015 0.1165
Detection Prevalence 0.10624 0.08299 0.06975 0.1165 0.1557
Balanced Accuracy 0.79399 1.00000 0.69708 0.9227 0.9728
Class: Other Class: Train Class: Tram Class: Walk
Sensitivity 0.63291 0.9800 0.9025 0.63250
Specificity 0.98540 0.9980 0.9236 0.96431
Pos Pred Value 0.81633 0.9849 0.6119 0.70278
Neg Pred Value 0.96321 0.9973 0.9861 0.95161
Prevalence 0.09300 0.1177 0.1177 0.11772
Detection Rate 0.05886 0.1154 0.1062 0.07446
Detection Prevalence 0.07210 0.1171 0.1736 0.10594
Balanced Accuracy 0.80916 0.9890 0.9131 0.79840
Code
# Precision for each classcat("\nPrecision for each class:\n")
# Save the modelssaveRDS(model.svmL, "models/model_svmL.rds")saveRDS(model.svmL.cv, "models/model_svmL_cv.rds")
3.8.2 Radial Support Vector Machine
The radial SVM performs slightly better than the linear SVM with an overall accuracy of 80.92% and similar recall, precision and f1-scores. This model however performs better, since the applied metrics vary less between classes.
Code
# Set seed for reproduceabilityset.seed(108)# Build Training Modelmodel.svmRadial <-train(transport_mode ~ .,data = TrainingSet,method ="svmRadial",na.action = na.omit,preprocess =c("scale", "center"),trControl =trainControl(method ="none"),tuneGrid =expand.grid(sigma =0.8683492, C =5)) # Build CV Model (long processing!!!)TrainingSet$transport_mode <-as.character(TrainingSet$transport_mode)model.svmRadial.cv <-train(transport_mode ~ .,data = TrainingSet,method ="svmRadial",na.action = na.omit,preprocess =c("scale", "center"),trControl = fitControl,tuneGrid =expand.grid(sigma =0.8683492, C =5))(model.svmRadial.cv$bestTune)
sigma C
1 0.8683492 5
Code
# Make Predictionsmodel.svmRadial.training <-predict(model.svmRadial, TrainingSet)model.svmRadial.testing <-predict(model.svmRadial, TestingSet)# Make Predictions from Cross Validation modelmodel.svmRadial.cv.training <-predict(model.svmRadial.cv, TrainingSet)model.svmRadial.cv.testing <-predict(model.svmRadial.cv, TestingSet)# Model Performancemodel.svmRadial.training.confusion <-confusionMatrix(model.svmRadial.training, as.factor(TrainingSet$transport_mode))model.svmRadial.testing.confusion <-confusionMatrix(model.svmRadial.testing, as.factor(TestingSet$transport_mode))model.svmRadial.cv.confusion <-confusionMatrix(model.svmRadial.cv.training, as.factor(TrainingSet$transport_mode))(model.svmRadial.cv.testing.confusion <-confusionMatrix(model.svmRadial.cv.testing, as.factor(TestingSet$transport_mode))) # Print test run with CV
Confusion Matrix and Statistics
Reference
Prediction Bike Boat Bus Car Horse Other Train Tram Walk
Bike 74 0 11 10 1 2 0 5 0
Boat 0 65 0 0 0 0 0 0 0
Bus 11 0 65 6 1 1 1 7 2
Car 3 0 5 77 0 1 0 0 2
Horse 1 0 0 1 93 2 0 0 24
Other 5 0 2 1 1 60 0 1 4
Train 3 5 2 4 0 5 99 4 5
Tram 3 0 10 0 0 3 0 83 6
Walk 0 0 5 1 4 5 0 0 57
Overall Statistics
Accuracy : 0.7927
95% CI : (0.7638, 0.8195)
No Information Rate : 0.1178
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7663
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Bike Class: Boat Class: Bus Class: Car Class: Horse
Sensitivity 0.74000 0.92857 0.65000 0.77000 0.9300
Specificity 0.96128 1.00000 0.96128 0.98531 0.9626
Pos Pred Value 0.71845 1.00000 0.69149 0.87500 0.7686
Neg Pred Value 0.96515 0.99362 0.95364 0.96978 0.9904
Prevalence 0.11779 0.08245 0.11779 0.11779 0.1178
Detection Rate 0.08716 0.07656 0.07656 0.09069 0.1095
Detection Prevalence 0.12132 0.07656 0.11072 0.10365 0.1425
Balanced Accuracy 0.85064 0.96429 0.80564 0.87766 0.9463
Class: Other Class: Train Class: Tram Class: Walk
Sensitivity 0.75949 0.9900 0.83000 0.57000
Specificity 0.98182 0.9626 0.97063 0.97997
Pos Pred Value 0.81081 0.7795 0.79048 0.79167
Neg Pred Value 0.97548 0.9986 0.97715 0.94466
Prevalence 0.09305 0.1178 0.11779 0.11779
Detection Rate 0.07067 0.1166 0.09776 0.06714
Detection Prevalence 0.08716 0.1496 0.12367 0.08481
Balanced Accuracy 0.87066 0.9763 0.90031 0.77499
Code
# Precision for each classcat("\nPrecision for each class:\n")
# Save the modelssaveRDS(model.svmRadial, "models/model_svmRadial.rds")saveRDS(model.svmRadial.cv, "models/model_svmRadial_cv.rds")
3.8.3 Polynomial SVM
Out of all tested models the polynomial SVM achieved the highest overall accuracy with 83.86% and the best performance for recall, precision and F1-score. The by class performance is significantly better compared to the other models. The Cohen’s Kappa value lies at 0.81 indicating high agreement between the predictions and ground truth labels. The p-value indicates that the accuracy of the polynomial SVM model is significantly better than the no information rate.
Code
set.seed(100)# Build Training Modelmodel.svmPoly <-train(transport_mode ~ ., data = TrainingSet,method ="svmPoly",na.action = na.omit,preprocess =c("sclae","center"),trControl =trainControl(method ="none"),tuneGrid =data.frame(degree =3, scale =0.1, C =4) )# Build CV Model (long processing)TrainingSet$transport_mode <-as.character(TrainingSet$transport_mode)model.svmPoly.cv <-train(transport_mode ~ ., data = TrainingSet,method ="svmPoly",na.action = na.omit,preprocess =c("sclae","center"),trControl = fitControl,tuneGrid =data.frame(degree =3, scale =0.1, C =4) # Fit Model) )(model.svmPoly.cv$bestTune)
degree scale C
1 3 0.1 4
Code
# Make Predictionsmodel.svmPoly.training <-predict(model.svmPoly, TrainingSet)model.svmPoly.testing <-predict(model.svmPoly, TestingSet)# Make Predictions from Cross Validation modelmodel.svmPoly.cv.training <-predict(model.svmPoly.cv, TrainingSet)model.svmPoly.cv.testing <-predict(model.svmPoly.cv, TestingSet)# Model Performancemodel.svmPoly.training.confusion <-confusionMatrix(model.svmPoly.training, as.factor(TrainingSet$transport_mode))model.svmPoly.testing.confusion <-confusionMatrix(model.svmPoly.testing, as.factor(TestingSet$transport_mode))model.svmPoly.cv.confusion <-confusionMatrix(model.svmPoly.cv.training, as.factor(TrainingSet$transport_mode))(model.svmPoly.cv.testing.confusion <-confusionMatrix(model.svmPoly.cv.testing, as.factor(TestingSet$transport_mode))) # Print test run with CV
Confusion Matrix and Statistics
Reference
Prediction Bike Boat Bus Car Horse Other Train Tram Walk
Bike 79 0 19 13 0 3 0 6 1
Boat 0 70 0 0 0 0 0 0 0
Bus 8 0 59 5 0 2 2 1 2
Car 1 0 6 80 0 1 0 0 2
Horse 1 0 0 1 99 3 0 0 25
Other 5 0 2 0 1 60 0 0 5
Train 0 0 0 0 0 0 98 1 0
Tram 6 0 10 0 0 3 0 90 5
Walk 0 0 4 1 0 7 0 2 60
Overall Statistics
Accuracy : 0.8186
95% CI : (0.791, 0.844)
No Information Rate : 0.1178
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7956
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Bike Class: Boat Class: Bus Class: Car Class: Horse
Sensitivity 0.79000 1.00000 0.59000 0.80000 0.9900
Specificity 0.94393 1.00000 0.97330 0.98665 0.9599
Pos Pred Value 0.65289 1.00000 0.74684 0.88889 0.7674
Neg Pred Value 0.97115 1.00000 0.94675 0.97365 0.9986
Prevalence 0.11779 0.08245 0.11779 0.11779 0.1178
Detection Rate 0.09305 0.08245 0.06949 0.09423 0.1166
Detection Prevalence 0.14252 0.08245 0.09305 0.10601 0.1519
Balanced Accuracy 0.86696 1.00000 0.78165 0.89332 0.9750
Class: Other Class: Train Class: Tram Class: Walk
Sensitivity 0.75949 0.9800 0.9000 0.60000
Specificity 0.98312 0.9987 0.9680 0.98131
Pos Pred Value 0.82192 0.9899 0.7895 0.81081
Neg Pred Value 0.97552 0.9973 0.9864 0.94839
Prevalence 0.09305 0.1178 0.1178 0.11779
Detection Rate 0.07067 0.1154 0.1060 0.07067
Detection Prevalence 0.08598 0.1166 0.1343 0.08716
Balanced Accuracy 0.87131 0.9893 0.9340 0.79065
Code
# Precision for each classcat("\nPrecision for each class:\n")
# Save the modelssaveRDS(model.svmPoly, "models/model_svmPoly.rds")saveRDS(model.svmPoly.cv, "models/model_svmPoly_cv.rds")
Since the polynomial SVM showed the best performance, this model is used to predict the transport mode on the full data set, containing 40’529 data points after preprocessing and threshold filtering. The achieved overall accuracy is 82.1% with the 95% confidence interval of [81.73%, 82.48%]. The full data set is very imbalanced, nevertheless the unweighted averaged F1-score lies at 80.7%
Code
# Set seed for reproducibilityset.seed(100)# Run Model on full data setmodel.final <-predict(model.svmPoly.cv, working_dataset_full)# Create final data frameworking_dataset_result <-data.frame(working_dataset_full, model.final) # Confusion Matrix for new resultsconf_matrix <-confusionMatrix(as.factor(working_dataset_result$transport_mode), as.factor(working_dataset_result$model.final))cat("Confusion Matrix:\n")
Confusion Matrix:
Code
conf_matrix
Confusion Matrix and Statistics
Reference
Prediction Bike Boat Bus Car Horse Other Train Tram Walk
Bike 1514 3 95 20 33 72 0 87 29
Boat 0 352 0 0 0 0 0 0 0
Bus 481 0 2145 212 28 119 17 474 260
Car 383 10 248 4382 137 115 24 22 30
Horse 6 0 1 18 3657 22 0 0 13
Other 5 0 4 5 13 323 0 21 24
Train 107 0 78 92 0 2 12394 96 12
Tram 168 0 166 13 0 84 46 3790 231
Walk 109 3 234 61 2056 409 22 349 4599
Overall Statistics
Accuracy : 0.8182
95% CI : (0.8144, 0.8219)
No Information Rate : 0.3085
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7797
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Bike Class: Boat Class: Bus Class: Car Class: Horse
Sensitivity 0.54598 0.956522 0.72198 0.9123 0.61732
Specificity 0.99102 1.000000 0.95763 0.9729 0.99827
Pos Pred Value 0.81705 1.000000 0.57414 0.8189 0.98386
Neg Pred Value 0.96744 0.999602 0.97755 0.9880 0.93841
Prevalence 0.06843 0.009081 0.07331 0.1185 0.14618
Detection Rate 0.03736 0.008686 0.05293 0.1081 0.09024
Detection Prevalence 0.04572 0.008686 0.09219 0.1320 0.09172
Balanced Accuracy 0.76850 0.978261 0.83981 0.9426 0.80779
Class: Other Class: Train Class: Tram Class: Walk
Sensitivity 0.281850 0.9913 0.78322 0.8848
Specificity 0.998172 0.9862 0.98016 0.9082
Pos Pred Value 0.817722 0.9697 0.84260 0.5865
Neg Pred Value 0.979492 0.9961 0.97088 0.9817
Prevalence 0.028279 0.3085 0.11941 0.1283
Detection Rate 0.007970 0.3058 0.09352 0.1135
Detection Prevalence 0.009747 0.3154 0.11099 0.1935
Balanced Accuracy 0.640011 0.9887 0.88169 0.8965
Code
# Precision for each classprecision <- conf_matrix$byClass[, "Precision"]cat("\nPrecision for each class:\n")
Precision for each class:
Code
precision
Class: Bike Class: Boat Class: Bus Class: Car Class: Horse Class: Other
0.8170534 1.0000000 0.5741435 0.8189124 0.9838579 0.8177215
Class: Train Class: Tram Class: Walk
0.9697207 0.8425967 0.5864575
Code
# Average Precisionavg_precision <-mean(conf_matrix$byClass[, "Precision"])cat("\nAverage Precision:\n")
Average Precision:
Code
avg_precision
[1] 0.8233848
Code
# Recall for each classrecall <- conf_matrix$byClass[, "Recall"]cat("\nRecall for each class:\n")
Recall for each class:
Code
recall
Class: Bike Class: Boat Class: Bus Class: Car Class: Horse Class: Other
0.5459791 0.9565217 0.7219791 0.9123465 0.6173194 0.2818499
Class: Train Class: Tram Class: Walk
0.9912821 0.7832197 0.8847634
Code
# Average Recallavg_recall <-mean(conf_matrix$byClass[, "Recall"])cat("\nAverage Recall:\n")
Average Recall:
Code
avg_recall
[1] 0.7439179
Code
# F1-Score for each classf1_score <- conf_matrix$byClass[, "F1"]cat("\nF1-Score for each class:\n")
F1-Score for each class:
Code
f1_score
Class: Bike Class: Boat Class: Bus Class: Car Class: Horse Class: Other
0.6545612 0.9777778 0.6396302 0.8631081 0.7586350 0.4192083
Class: Train Class: Tram Class: Walk
0.9803829 0.8118239 0.7053681
Code
# Average F1-Scoreavg_f1_score <-mean(conf_matrix$byClass[, "F1"])cat("\nAverage F1-Score:\n")
Average F1-Score:
Code
avg_f1_score
[1] 0.7567217
Code
# Save working_dataset_result as a CSV filewrite.csv(working_dataset_result, "data/working_dataset_result.csv", row.names =FALSE)
The resulting class distribution shows that the model predicts too many points as train. This boosts the models performance, since the train class is strongly over represented in this data set. Between the transport modes Car, Bus, Bike and Tram we expected many false classifications, since key parameters such as velocity and acceleration lie in similar ranges and are difficult to distinguish by the model.
Bike Boat Bus Car Horse Other Train Tram Walk
1853 352 3736 5351 3717 395 12781 4498 7842
Code
table(working_dataset_result$model.final)
Bike Boat Bus Car Horse Other Train Tram Walk
2773 368 2971 4803 5924 1146 12503 4839 5198
3.9 Post Processing
To boost the model performance some simple post processing is applied. A moving window function is used to find misclassified points within segments. This function searches within x neighbors of a point and if a given percentage of these points belong to one class the point is reclassified as the majority of its neighboring points. This process can be applied iteratively.
For this data set a window size of 1, a threshold percentage of 75% and 3 iterations results in a smoothing of the results, but not necessarily a gain in model accuracy.
Code
# Run a loop to identify outlier points in classification. If prevous and following x points are identical, # but the middle one is different it is changed# Define the number of previous and following points to consider# x: Number of points to be looked at surrounding current value in each direction (x*2 neighbours considered)# threshold_percentage: number of points which have to be equal so the current value gets changed# i: number of iterationssingle_point_correction <-function(df, x, threshold_percentage, iterations) {# Track the number of points changed changed_count <-0for (iter in1:iterations) {for (i in (x +1):(nrow(df) - x)) { current_value <- df$model.final[i]# Find x-Previous & x-Following Values around point i previous_values <- df$model.final[(i - x):(i -1)] following_values <- df$model.final[(i +1):(i + x)]# Calculate the number of occurrences for each class in the surrounding points class_counts <-table(c(previous_values, following_values))# Find the class that occurs most frequently most_frequent_class <-names(class_counts)[which.max(class_counts)]# Check if the most frequent class exceeds the threshold countif (class_counts[most_frequent_class] > threshold_percentage *length(c(previous_values, following_values))) { df$model.final[i] <- most_frequent_class changed_count <- changed_count +1 } }message("Metrics after each iteration:") conf_matrix_func <-confusionMatrix(as.factor(df$transport_mode), as.factor(df$model.final))# Precision for each classcat("\n Mean Precision\n")print(precision_func <-mean(conf_matrix_func$byClass[, "Precision"]))# Recall for each classcat("\n Mean Recall\n")print(recall_func <-mean(conf_matrix_func$byClass[, "Recall"]))# F1-Score for each classcat("\n Mean F1-Score\n")print(f1_score_func <-mean(conf_matrix_func$byClass[, "F1"])) }message("Number of times the condition is true and values are updated:", changed_count)return(df)}working_dataset_result_copy <- working_dataset_resultworking_dataset_result <-single_point_correction(working_dataset_result, 10, 0.75, 3)
Mean Precision
[1] 0.8541377
Mean Recall
[1] 0.7786285
Mean F1-Score
[1] 0.791695
Mean Precision
[1] 0.8548945
Mean Recall
[1] 0.7794413
Mean F1-Score
[1] 0.792599
Mean Precision
[1] 0.8552145
Mean Recall
[1] 0.7796972
Mean F1-Score
[1] 0.7928828
Code
# Confusion Matrix for new resultsconf_matrix_2 <-confusionMatrix(as.factor(working_dataset_result$transport_mode), as.factor(working_dataset_result$model.final))cat("Confusion Matrix:\n")
Confusion Matrix:
Code
conf_matrix_2
Confusion Matrix and Statistics
Reference
Prediction Bike Boat Bus Car Horse Other Train Tram Walk
Bike 1663 0 32 10 34 65 0 36 13
Boat 0 352 0 0 0 0 0 0 0
Bus 460 0 2276 212 28 102 15 392 251
Car 357 2 191 4528 128 83 14 18 30
Horse 2 0 0 3 3704 6 0 0 2
Other 4 0 2 5 13 343 0 14 14
Train 85 0 38 59 0 2 12563 24 10
Tram 75 0 122 6 0 47 12 4012 224
Walk 92 2 214 49 2081 306 20 339 4739
Overall Statistics
Accuracy : 0.8434
95% CI : (0.8399, 0.847)
No Information Rate : 0.3115
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.81
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Bike Class: Boat Class: Bus Class: Car Class: Horse
Sensitivity 0.60738 0.988764 0.79165 0.9294 0.61857
Specificity 0.99497 1.000000 0.96122 0.9769 0.99962
Pos Pred Value 0.89746 1.000000 0.60921 0.8462 0.99650
Neg Pred Value 0.97220 0.999900 0.98372 0.9902 0.93795
Prevalence 0.06756 0.008785 0.07094 0.1202 0.14776
Detection Rate 0.04104 0.008686 0.05616 0.1117 0.09140
Detection Prevalence 0.04572 0.008686 0.09219 0.1320 0.09172
Balanced Accuracy 0.80117 0.994382 0.87644 0.9532 0.80910
Class: Other Class: Train Class: Tram Class: Walk
Sensitivity 0.359539 0.9952 0.8298 0.8970
Specificity 0.998686 0.9922 0.9864 0.9120
Pos Pred Value 0.868354 0.9829 0.8920 0.6043
Neg Pred Value 0.984774 0.9978 0.9772 0.9834
Prevalence 0.023541 0.3115 0.1193 0.1304
Detection Rate 0.008464 0.3100 0.0990 0.1169
Detection Prevalence 0.009747 0.3154 0.1110 0.1935
Balanced Accuracy 0.679112 0.9937 0.9081 0.9045
Code
# Precision for each classprecision_2 <- conf_matrix_2$byClass[, "Precision"]cat("\nPrecision for each class:\n")
Precision for each class:
Code
precision_2
Class: Bike Class: Boat Class: Bus Class: Car Class: Horse Class: Other
0.8974636 1.0000000 0.6092077 0.8461970 0.9965026 0.8683544
Class: Train Class: Tram Class: Walk
0.9829434 0.8919520 0.6043101
Code
# Average Precisionavg_precision_2 <-mean(conf_matrix_2$byClass[, "Precision"])cat("\nAverage Precision:\n")
Average Precision:
Code
avg_precision_2
[1] 0.8552145
Code
# Recall for each classrecall_2 <- conf_matrix_2$byClass[, "Recall"]cat("\nRecall for each class:\n")
Recall for each class:
Code
recall_2
Class: Bike Class: Boat Class: Bus Class: Car Class: Horse Class: Other
0.6073776 0.9887640 0.7916522 0.9293924 0.6185705 0.3595388
Class: Train Class: Tram Class: Walk
0.9951679 0.8297828 0.8970282
Code
# Average Recallavg_recall_2 <-mean(conf_matrix_2$byClass[, "Recall"])cat("\nAverage Recall:\n")
Average Recall:
Code
avg_recall_2
[1] 0.7796972
Code
# F1-Score for each classf1_score_2 <- conf_matrix_2$byClass[, "F1"]cat("\nF1-Score for each class:\n")
F1-Score for each class:
Code
f1_score_2
Class: Bike Class: Boat Class: Bus Class: Car Class: Horse Class: Other
0.7244609 0.9943503 0.6885494 0.8858456 0.7633179 0.5085248
Class: Train Class: Tram Class: Walk
0.9890179 0.8597450 0.7221333
Code
# Average F1-Scoreavg_f1_score <-mean(conf_matrix_2$byClass[, "F1"])cat("\nAverage F1-Score:\n")
Average F1-Score:
Code
avg_f1_score
[1] 0.7928828
After the preprocessing there was not significant difference in the class distribution.
Code
# Show class distributionfinal_classes <-ggplot(working_dataset_result) +geom_bar(aes(x = model.final)) +theme(axis.text.x =element_text(angle =45, hjust =1), panel.background =element_rect(fill ="transparent", color =NA)) +ylim(c(0,14000)) +ggtitle("Ground Truth Class Distribution") +xlab("Transport Mode")classes <-ggplot(working_dataset_result) +geom_bar(aes(x = transport_mode)) +theme(axis.text.x =element_text(angle =45, hjust =1), panel.background =element_rect(fill ="transparent", color =NA)) +ylim(c(0,14000)) +ggtitle("Class Distribution After \nPost Processing") +xlab("Transport Mode")grid.arrange(final_classes,classes, nrow =1)
Code
cat("Ground Truth\n")
Ground Truth
Code
table(working_dataset_result$transport_mode)
Bike Boat Bus Car Horse Other Train Tram Walk
1853 352 3736 5351 3717 395 12781 4498 7842
Code
cat("\nClassification \n")
Classification
Code
table(working_dataset_result$model.final)
Bike Boat Bus Car Horse Other Train Tram Walk
2738 356 2875 4872 5988 954 12624 4835 5283
The below map allows the comparison between wrongly classified points and ground truth to explore where the model fails.
The preprocessing of GPS data plays a crucial role in influencing the classification results. While individual computed parameters such as velocity, acceleration, and sinuosity provide valuable information, they are insufficient to construct a robust model on their own. However, applying moving window functions to these parameters can greatly enhance the accuracy of the model,67 .
To effectively differentiate between similar classes like buses, trams, cars, bikes, and boats, additional parameters need to be considered. For instance, incorporating the distance to public traffic networks specific to each transport mode can significantly improve the accuracy of the model. These additional parameters provide valuable contextual information that aids in distinguishing between similar classes.
In urban settings, distinguishing between bus, tram, and car travel poses a particular challenge due to the characteristic stop-and-go movement patterns. The frequent fluctuations between low velocities and accelerations make it difficult to discern the specific class. These movement patterns can correspond to multiple classes and create ambiguity in the classification process. By addressing these challenges model accuracy can be enhanced.
5. Discussion
In order to enhance the overall classification accuracy, it is crucial to adopt a more strategic approach to test various parameters and their impact on the classification results. This includes exploring different preprocessing techniques, employing diverse models, and implementing appropriate post-processing steps. Specifically moving window size, which imparts a smoothing effect on computed parameters, and the hyper parameters of the SVM models could benefit from further refinement with increased computational power.
In related studies on transport mode detection, segmentation has been successfully applied to the data,8.9 In this context, point data was utilized to investigate whether the classification model could autonomously identify distinct segments. Preliminary results suggest that the model often identifies segments, but further analysis is necessary to validate these findings. Furthermore, Biljecki et al.10 proposed categorizing different transport modes into land, water, and air travel and classify each individually. This approach was not implemented, but by incorporating distance-to-water calculations to identify instances of boat travel, it is possible to identify boat travel within the same model as land travel.
To improve the data quality of GPS data, there are several potential avenues to explore. One approach is to employ a quicker sampling interval, allowing for more frequent data points to be captured. Additionally, supplementing GPS data with accelerator data, as demonstrated by Roy et al.,11 has been shown to enhance model performance, leading to an accuracy improvement of approximately 90%.
Ultimately, the quality of GPS data is the key-weakness of our model, despite the data quality it is shown that it is possible to classify transport modes with an approximate 85% accuracy.
References
Code
wordcountaddin::text_stats("index.qmd")
Method
koRpus
stringi
Word count
2955
2886
Character count
19190
19233
Sentence count
197
Not available
Reading time
14.8 minutes
14.4 minutes
References
Biljecki, Filip, Hugo Ledoux, and Peter van Oosterom. “Transportation Mode-Based Segmentation and Classification of Movement Trajectories.”International Journal of Geographical Information Science 27 (February 2013): 385–407. doi:10.1080/13658816.2012.692791.
Geoinformation Kt. Bern, Amt für. “Öffentlicher Verkehr,” 2023.
Raumentwicklung Kt. Zürich, Amt für. “Linien Des Öffentlichen Verkehrs,” 2022.
Roy, Avipsa, Daniel Fuller, Kevin Stanley, and Trisalyn Nelson. “Classifying Transport Mode from Global Positioning Systems and Accelerometer Data: A Machine Learning Approach.”Findings, September 2020. doi:10.32866/001c.14520.
Topography swisstopo, Federal Office of. “swissTLM3D,” 2023.
Transport FOT, Federal Bureau of. “Öffentlicher Verkehr,” 2023.